{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "graffitiCellId": "id_nporece",
    "id": "1054EE94F8104E6A80A156562A92897A",
    "jupyter": {},
    "mdEditEnable": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "# 语言模型\n",
    "\n",
    "一段自然语言文本可以看作是一个离散时间序列，给定一个长度为$T$的词的序列$w_1, w_2, \\ldots, w_T$，语言模型的目标就是评估该序列是否合理，即计算该序列的概率：\n",
    "\n",
    "\n",
    "$$\n",
    "P(w_1, w_2, \\ldots, w_T).\n",
    "$$\n",
    "\n",
    "\n",
    "本节我们介绍基于统计的语言模型，主要是$n$元语法（$n$-gram）。在后续内容中，我们将会介绍基于神经网络的语言模型。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "graffitiCellId": "id_j2098vn",
    "id": "A05A3A43703846CDBCD2576B7BA3E21B",
    "jupyter": {},
    "mdEditEnable": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## 语言模型\n",
    "\n",
    "\n",
    "假设序列$w_1, w_2, \\ldots, w_T$中的每个词是依次生成的，我们有\n",
    "\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "P(w_1, w_2, \\ldots, w_T)\n",
    "&= \\prod_{t=1}^T P(w_t \\mid w_1, \\ldots, w_{t-1})\\\\\n",
    "&= P(w_1)P(w_2 \\mid w_1) \\cdots P(w_T \\mid w_1w_2\\cdots w_{T-1})\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "\n",
    "例如，一段含有4个词的文本序列的概率\n",
    "\n",
    "\n",
    "$$\n",
    "P(w_1, w_2, w_3, w_4) =  P(w_1) P(w_2 \\mid w_1) P(w_3 \\mid w_1, w_2) P(w_4 \\mid w_1, w_2, w_3).\n",
    "$$\n",
    "\n",
    "\n",
    "语言模型的参数就是词的概率以及给定前几个词情况下的条件概率。设训练数据集为一个大型文本语料库，如维基百科的所有条目，词的概率可以通过该词在训练数据集中的相对词频来计算，例如，$w_1$的概率可以计算为：\n",
    "\n",
    "\n",
    "$$\n",
    "\\hat P(w_1) = \\frac{n(w_1)}{n}\n",
    "$$\n",
    "\n",
    "\n",
    "其中$n(w_1)$为语料库中以$w_1$的出现次数，$n$为语料库中文本的总数量。\n",
    "\n",
    "类似的，给定$w_1$情况下，$w_2$的条件概率可以计算为：\n",
    "\n",
    "\n",
    "$$\n",
    "\\hat P(w_2 \\mid w_1) = \\frac{n(w_1, w_2)}{n(w_1)}\n",
    "$$\n",
    "\n",
    "\n",
    "其中$n(w_1, w_2)$为语料库中以$w_1$作为第一个词，$w_2$作为第二个词的文本的数量。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "graffitiCellId": "id_6801mjh",
    "id": "251399EF3DD149548D760A5F24266338",
    "jupyter": {},
    "mdEditEnable": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## n元语法\n",
    "\n",
    "序列长度增加，计算和存储多个词共同出现的概率的复杂度会呈指数级增加。$n$元语法通过马尔可夫假设简化模型，马尔科夫假设是指一个词的出现只与前面$n$个词相关，即$n$阶马尔可夫链（Markov chain of order $n$），如果$n=1$，那么有$P(w_3 \\mid w_1, w_2) = P(w_3 \\mid w_2)$。基于$n-1$阶马尔可夫链，我们可以将语言模型改写为\n",
    "\n",
    "\n",
    "$$\n",
    "P(w_1, w_2, \\ldots, w_T) = \\prod_{t=1}^T P(w_t \\mid w_{t-(n-1)}, \\ldots, w_{t-1}) .\n",
    "$$\n",
    "\n",
    "\n",
    "以上也叫$n$元语法（$n$-grams），它是基于$n - 1$阶马尔可夫链的概率语言模型。例如，当$n=2$时，含有4个词的文本序列的概率就可以改写为：\n",
    "\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "P(w_1, w_2, w_3, w_4)\n",
    "&= P(w_1) P(w_2 \\mid w_1) P(w_3 \\mid w_1, w_2) P(w_4 \\mid w_1, w_2, w_3)\\\\\n",
    "&= P(w_1) P(w_2 \\mid w_1) P(w_3 \\mid w_2) P(w_4 \\mid w_3)\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "\n",
    "当$n$分别为1、2和3时，我们将其分别称作一元语法（unigram）、二元语法（bigram）和三元语法（trigram）。例如，长度为4的序列$w_1, w_2, w_3, w_4$在一元语法、二元语法和三元语法中的概率分别为\n",
    "\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "P(w_1, w_2, w_3, w_4) &=  P(w_1) P(w_2) P(w_3) P(w_4) ,\\\\\n",
    "P(w_1, w_2, w_3, w_4) &=  P(w_1) P(w_2 \\mid w_1) P(w_3 \\mid w_2) P(w_4 \\mid w_3) ,\\\\\n",
    "P(w_1, w_2, w_3, w_4) &=  P(w_1) P(w_2 \\mid w_1) P(w_3 \\mid w_1, w_2) P(w_4 \\mid w_2, w_3) .\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "\n",
    "当$n$较小时，$n$元语法往往并不准确。例如，在一元语法中，由三个词组成的句子“你走先”和“你先走”的概率是一样的。然而，当$n$较大时，$n$元语法需要计算并存储大量的词频和多词相邻频率。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "graffitiCellId": "id_3wfv5oo",
    "id": "CFDDB4CFC6CA4F558608DA01CD07F9A0",
    "jupyter": {},
    "mdEditEnable": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "思考：$n$元语法可能有哪些缺陷？"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "E0358FB3352B4843883573902E9AEFDF",
    "jupyter": {},
    "mdEditEnable": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "1. 参数空间过大\n",
    "2. 数据稀疏"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "graffitiCellId": "id_cpdv382",
    "id": "855C64A9C75A4C6C88134AC099FCD6D6",
    "jupyter": {},
    "mdEditEnable": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "# 语言模型数据集"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "graffitiCellId": "id_0419ugr",
    "id": "131B4EA95293494D8D7E089964514961",
    "jupyter": {},
    "mdEditEnable": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## 读取数据集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "graffitiCellId": "id_210oytz",
    "id": "434D33D454FB42C38C6F9816E9E1FBF4",
    "jupyter": {},
    "scrolled": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "63282\n",
      "想要有直升机\n",
      "想要和你飞到宇宙去\n",
      "想要和你融化在一起\n",
      "融化在宇宙里\n",
      "我每天每天每\n"
     ]
    }
   ],
   "source": [
    "with open('/home/kesci/input/jaychou9972/jaychou_lyrics.txt') as f:\n",
    "    corpus_chars = f.read()\n",
    "print(len(corpus_chars))\n",
    "print(corpus_chars[: 40])\n",
    "corpus_chars = corpus_chars.replace('\\n', ' ').replace('\\r', ' ')\n",
    "corpus_chars = corpus_chars[: 10000]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "graffitiCellId": "id_ls5ngie",
    "id": "FAC600069D7A4E8E84F4E9D8C1232B27",
    "jupyter": {},
    "mdEditEnable": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## 建立字符索引"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "graffitiCellId": "id_rpxalgy",
    "id": "B76CFE1C0F69419AA0019EC506E6A7EB",
    "jupyter": {},
    "scrolled": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1027\n",
      "chars: 想要有直升机 想要和你飞到宇宙去 想要和\n",
      "indices: [1014, 18, 805, 651, 374, 192, 838, 1014, 18, 1006, 714, 63, 639, 1011, 576, 256, 838, 1014, 18, 1006]\n"
     ]
    }
   ],
   "source": [
    "idx_to_char = list(set(corpus_chars)) # 去重，得到索引到字符的映射\n",
    "char_to_idx = {char: i for i, char in enumerate(idx_to_char)} # 字符到索引的映射\n",
    "vocab_size = len(char_to_idx)\n",
    "print(vocab_size)\n",
    "\n",
    "corpus_indices = [char_to_idx[char] for char in corpus_chars]  # 将每个字符转化为索引，得到一个索引的序列\n",
    "sample = corpus_indices[: 20]\n",
    "print('chars:', ''.join([idx_to_char[idx] for idx in sample]))\n",
    "print('indices:', sample)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "graffitiCellId": "id_4k8xe1z",
    "id": "AEC7A5F1A73E4E5B8B5C87B28A82E553",
    "jupyter": {},
    "mdEditEnable": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "定义函数`load_data_jay_lyrics`，在后续章节中直接调用。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "graffitiCellId": "id_g98p1e0",
    "id": "82B3F014ECB14F099961B196CF98D42F",
    "jupyter": {},
    "scrolled": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "def load_data_jay_lyrics():\n",
    "    with open('/home/kesci/input/jaychou9972/jaychou_lyrics.txt') as f:\n",
    "        corpus_chars = f.read()\n",
    "    corpus_chars = corpus_chars.replace('\\n', ' ').replace('\\r', ' ')\n",
    "    corpus_chars = corpus_chars[0:10000]\n",
    "    idx_to_char = list(set(corpus_chars))\n",
    "    char_to_idx = dict([(char, i) for i, char in enumerate(idx_to_char)])\n",
    "    vocab_size = len(char_to_idx)\n",
    "    corpus_indices = [char_to_idx[char] for char in corpus_chars]\n",
    "    return corpus_indices, char_to_idx, idx_to_char, vocab_size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "graffitiCellId": "id_ugfju1n",
    "id": "20B12ED94FF145F089127459E501DC68",
    "jupyter": {},
    "mdEditEnable": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## 时序数据的采样\n",
    "\n",
    "在训练中我们需要每次随机读取小批量样本和标签。与之前章节的实验数据不同的是，时序数据的一个样本通常包含连续的字符。假设时间步数为5，样本序列为5个字符，即“想”“要”“有”“直”“升”。该样本的标签序列为这些字符分别在训练集中的下一个字符，即“要”“有”“直”“升”“机”，即$X$=“想要有直升”，$Y$=“要有直升机”。\n",
    "\n",
    "现在我们考虑序列“想要有直升机，想要和你飞到宇宙去”，如果时间步数为5，有以下可能的样本和标签：\n",
    "* $X$：“想要有直升”，$Y$：“要有直升机”\n",
    "* $X$：“要有直升机”，$Y$：“有直升机，”\n",
    "* $X$：“有直升机，”，$Y$：“直升机，想”\n",
    "* ...\n",
    "* $X$：“要和你飞到”，$Y$：“和你飞到宇”\n",
    "* $X$：“和你飞到宇”，$Y$：“你飞到宇宙”\n",
    "* $X$：“你飞到宇宙”，$Y$：“飞到宇宙去”\n",
    "\n",
    "可以看到，如果序列的长度为$T$，时间步数为$n$，那么一共有$T-n$个合法的样本，但是这些样本有大量的重合，我们通常采用更加高效的采样方式。我们有两种方式对时序数据进行采样，分别是随机采样和相邻采样。\n",
    "\n",
    "### 随机采样\n",
    "\n",
    "下面的代码每次从数据里随机采样一个小批量。其中批量大小`batch_size`是每个小批量的样本数，`num_steps`是每个样本所包含的时间步数。\n",
    "在随机采样中，每个样本是原始序列上任意截取的一段序列，相邻的两个随机小批量在原始序列上的位置不一定相毗邻。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "graffitiCellId": "id_07sma6k",
    "id": "31D18572438E4FE5856F1F24BB0C430C",
    "jupyter": {},
    "scrolled": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "import random\n",
    "def data_iter_random(corpus_indices, batch_size, num_steps, device=None):\n",
    "    # 减1是因为对于长度为n的序列，X最多只有包含其中的前n - 1个字符\n",
    "    num_examples = (len(corpus_indices) - 1) // num_steps  # 下取整，得到不重叠情况下的样本个数\n",
    "    example_indices = [i * num_steps for i in range(num_examples)]  # 每个样本的第一个字符在corpus_indices中的下标\n",
    "    random.shuffle(example_indices)\n",
    "\n",
    "    def _data(i):\n",
    "        # 返回从i开始的长为num_steps的序列\n",
    "        return corpus_indices[i: i + num_steps]\n",
    "    if device is None:\n",
    "        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
    "    \n",
    "    for i in range(0, num_examples, batch_size):\n",
    "        # 每次选出batch_size个随机样本\n",
    "        batch_indices = example_indices[i: i + batch_size]  # 当前batch的各个样本的首字符的下标\n",
    "        X = [_data(j) for j in batch_indices]\n",
    "        Y = [_data(j + 1) for j in batch_indices]\n",
    "        yield torch.tensor(X, device=device), torch.tensor(Y, device=device)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "graffitiCellId": "id_edvdo82",
    "id": "5901DE6ECDA04E6F87E3FF2B0F697847",
    "jupyter": {},
    "mdEditEnable": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "测试一下这个函数，我们输入从0到29的连续整数作为一个人工序列，设批量大小和时间步数分别为2和6，打印随机采样每次读取的小批量样本的输入`X`和标签`Y`。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "graffitiCellId": "id_30cx4t4",
    "id": "5E8FDD7CDC6E43D884D4976DAB8A9191",
    "jupyter": {},
    "scrolled": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "X:  tensor([[12, 13, 14, 15, 16, 17],\n",
      "        [18, 19, 20, 21, 22, 23]]) \n",
      "Y: tensor([[13, 14, 15, 16, 17, 18],\n",
      "        [19, 20, 21, 22, 23, 24]]) \n",
      "\n",
      "X:  tensor([[ 6,  7,  8,  9, 10, 11],\n",
      "        [ 0,  1,  2,  3,  4,  5]]) \n",
      "Y: tensor([[ 7,  8,  9, 10, 11, 12],\n",
      "        [ 1,  2,  3,  4,  5,  6]]) \n",
      "\n"
     ]
    }
   ],
   "source": [
    "my_seq = list(range(30))\n",
    "for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6):\n",
    "    print('X: ', X, '\\nY:', Y, '\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "graffitiCellId": "id_z179h28",
    "id": "B233BB0A25CA41A6998C3ECEE5B4A3FB",
    "jupyter": {},
    "mdEditEnable": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "### 相邻采样\n",
    "\n",
    "在相邻采样中，相邻的两个随机小批量在原始序列上的位置相毗邻。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "graffitiCellId": "id_6omgvln",
    "id": "8A18E8E0E87641EE84B8FE61A1D9ADD6",
    "jupyter": {},
    "scrolled": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "def data_iter_consecutive(corpus_indices, batch_size, num_steps, device=None):\n",
    "    if device is None:\n",
    "        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
    "    corpus_len = len(corpus_indices) // batch_size * batch_size  # 保留下来的序列的长度\n",
    "    corpus_indices = corpus_indices[: corpus_len]  # 仅保留前corpus_len个字符\n",
    "    indices = torch.tensor(corpus_indices, device=device)\n",
    "    indices = indices.view(batch_size, -1)  # resize成(batch_size, )\n",
    "    batch_num = (indices.shape[1] - 1) // num_steps\n",
    "    for i in range(batch_num):\n",
    "        i = i * num_steps\n",
    "        X = indices[:, i: i + num_steps]\n",
    "        Y = indices[:, i + 1: i + num_steps + 1]\n",
    "        yield X, Y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "graffitiCellId": "id_5isfefi",
    "id": "B30A869DE4E3466891AAAEF45FCAC7F9",
    "jupyter": {},
    "mdEditEnable": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "同样的设置下，打印相邻采样每次读取的小批量样本的输入`X`和标签`Y`。相邻的两个随机小批量在原始序列上的位置相毗邻。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "graffitiCellId": "id_7t0gy91",
    "id": "6C18F0586CE64AC290556069BD63B15A",
    "jupyter": {},
    "scrolled": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "X:  tensor([[ 0,  1,  2,  3,  4,  5],\n",
      "        [15, 16, 17, 18, 19, 20]]) \n",
      "Y: tensor([[ 1,  2,  3,  4,  5,  6],\n",
      "        [16, 17, 18, 19, 20, 21]]) \n",
      "\n",
      "X:  tensor([[ 6,  7,  8,  9, 10, 11],\n",
      "        [21, 22, 23, 24, 25, 26]]) \n",
      "Y: tensor([[ 7,  8,  9, 10, 11, 12],\n",
      "        [22, 23, 24, 25, 26, 27]]) \n",
      "\n"
     ]
    }
   ],
   "source": [
    "for X, Y in data_iter_consecutive(my_seq, batch_size=2, num_steps=6):\n",
    "    print('X: ', X, '\\nY:', Y, '\\n')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
